Towards Automated Design, Analysis and Optimization of Declarative Curation Workflows

نویسندگان

  • Tianhong Song
  • Sven Köhler
  • Bertram Ludäscher
  • James Hanken
  • Maureen Kelly
  • David Lowery
  • James A. Macklin
  • Paul J. Morris
  • Robert A. Morris
چکیده

Data curation is increasingly important. Our previous work on a Kepler curation package has demonstrated advantages that come from automating data curation pipelines by using workflow systems. However, manually designed curation workflows can be error-prone and inefficient due to a lack of user understanding of the workflow system, misuse of actors, or human error. Correcting problematic workflows is often very time-consuming. A more proactive workflow system can help users avoid such pitfalls. For example, static analysis before execution can be used to detect the potential problems in a workflow and help the user to improve workflow design. In this paper, we propose a declarative workflow approach that supports semi-automated workflow design, analysis and optimization. We show how the workflow design engine helps users to construct data curation workflows, how the workflow analysis engine detects different design problems of workflows and how workflows can be optimized by exploiting parallelism. Received 13 January 2014 | Accepted 26 February 2014 Correspondence should be addressed to Tianhong Song, Department of Computer Science, University of California, Davis. Email: [email protected] An earlier version of this paper was presented at the 9 International Digital Curation Conference. The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. The IJDC is published by the University of Edinburgh on behalf of the Digital Curation Centre. ISSN: 1746-8256. URL: http://www.ijdc.net/ Copyright rests with the authors. This work is released under a Creative Commons Attribution (UK) Licence, version 2.0. For details please see http://creativecommons.org/licenses/by/2.0/uk/ International Journal of Digital Curation 2014, Vol. 9, Iss. 2, 111–122 111 http://dx.doi.org/10.2218/ijdc.v9i2.337 DOI: 10.2218/ijdc.v9i2.337 112 | Automated Workflow Design, Analysis and Optimization doi:10.2218/ijdc.v9i2.337 Introduction and Motivation Data curation is critical in many areas, such as the production and use of scientific data collections and repositories. For example, natural science collections data can often be rife with errors and inconsistencies, and reuse of collections data to address scientific questions imposes concerns of data quality and fitness for use upon the collection’s management community. We have continued our development of Kuration 1.0 (Dou et al., 2011), a software package and prototype for automating data curation pipelines with the Kepler scientific workflow system (Ludäscher et al., 2006). Several curation tools and services are integrated into this package as actors, enabling the construction of workflows to perform and document various data curation tasks. The typical structure of a data curation workflow includes an input actor that reads a dataset from a remote or local source, a number of data curation actors that implement different data validation methods, and an output actor that writes the result into a file or a database. Figure 1. Kuration 1.0: A Kepler/COMAD data curation workflow for collection-oriented data quality control. Figure 1 shows a data curation workflow in the Kepler workflow system (Dou et al., 2012) developed by using the COMAD workflow model (McPhillips et al., 2009). Boxes are actors (also known as processors, steps or modules) and arrows connecting them are data channels indicating the data flow between actors. COMAD and the related data assembly line approach (Zinn et al., 2009a) are an improvement over the earlier and “conventional” way of designing workflows in Kepler, as they simplify the workflow design. Practical experience with our first IJDC | General Article doi:10.2218/ijdc.v9i2.337 Tianhong Song et al. | 113 Kuration prototype, while promising (Dou et al., 2012), also yielded a number of technical challenges: 1) the use of remote services on large data collections, together with an unoptimized workflow execution model, creates scalability problems; and 2) the design, configuration and maintenance of workflows are not only time-consuming but also can be error-prone, e.g., the workflow design itself might have problems, such as incorrectly ordered actors in a workflow (see Q3 below). In order to tackle these and related issues, we propose a visionary system that (semi-) automatically detects workflow design problems and optimizes workflow design. In the first phase, the design engine selects actors from a library of existing actors based on the user’s requirements and puts them together into a “workflow story” (i.e., a sequential order of actors). Alternatively, users can provide their own workflow. In the second phase, static analysis techniques are applied before runtime, which anticipate how a workflow might behave during runtime. Using workflow graph and actor configurations, corresponding data dependency information can be captured, which in turn can be used to identify possible design problems. In the third phase, the candidate designs are further improved or optimized (e.g., exploiting parallelism) in order to achieve better performance. The flowchart in Figure 2 shows how our system works. Figure 2. Overview of the proposed workflow design, analysis and optimization system. The remainder of this paper is organized as follows: first, we introduce our workflow model; next, we describe detailed design, analysis and optimization techniques and scenarios with examples; finally, we conclude with additional implementation details and a discussion of future work.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards the Preservation of Scientific Workflows

Some of the shared digital artefacts of digital research are executable in the sense that they describe an automated process which generates results. One example is the computational scientific workflow which is used to conduct automated data analysis, predictions and validations. We describe preservation challenges of scientific workflows, and suggest a framework to discuss the reproducibility...

متن کامل

Towards a Structured Workflow Language for Model Management

In Model Driven Engineering (MDE), models and mappings play a key role in system design. However, in practice, models and mappings do not exist in isolation, but are combined to form systems of interrelated models. We call the trace of operations, such as model transformations or model merges, between an initial configuration of a system of interrelated models to a final one, a workflow. Curren...

متن کامل

Dynamic configuration and collaborative scheduling in supply chains based on scalable multi-agent architecture

Due to diversified and frequently changing demands from customers, technological advances and global competition, manufacturers rely on collaboration with their business partners to share costs, risks and expertise. How to take advantage of advancement of technologies to effectively support operations and create competitive advantage is critical for manufacturers to survive. To respond to these...

متن کامل

Towards Automated Performance Optimization of BPMN Business Processes

Business Process Model and Notation (BPMN) provides a standard for the design of business processes. It focuses on bridging the gap between the analysis and the technical perspectives, and aims to deliver process automation. The aim of this technical report is to complement this effort by transferring knowledge from the related field of data-centric workflows aiming to provide automated perform...

متن کامل

Fixture Design Automation and Optimization Techniques: Review and Future Trends

Fixture design is crucial part of manufacturing process. Fixture design is a critical design activity process, in which automation plays an integral role in linking computer-aided design (CAD) and computer-aided manufacturing (CAD). This paper presents a literature review in computer aided fixture design (CAFD) in terms of automation and optimization techniques over the past decades. First, the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • IJDC

دوره 9  شماره 

صفحات  -

تاریخ انتشار 2014